{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Weak Supervision with Label Studio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perform weak supervision with Label Studio, using noisy labels on a large amount of training data to automatically build a useful training dataset for your supervised learning model. \n",
    "\n",
    "In this example, use the [Label Studio SDK](https://labelstud.io/sdk/index.html) to write a Python script that adds noisy labels to tasks, then adds those tasks to Label Studio for review and correction in a supervised learning setting.\n",
    "\n",
    "**Note:** This code utilizes functions from an older version of the Label Studio SDK (v0.0.34).\n",
    "The newer versions v1.0 and above still support the functionalities of the old version, but you will need to specify\n",
    "[`label_studio_sdk._legacy`](../../README.md) in your script."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to Label Studio\n",
    "\n",
    "Connect to the API for Label Studio Community, Enterprise, or Teams edition. Use the Client module of the Label Studio SDK and check the connection is working:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from label_studio_sdk import Client\n",
    "\n",
    "ls = Client(url='http://localhost:8000', api_key='d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3')\n",
    "ls.check_connection()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a project\n",
    "\n",
    "Create a simple text classification project to perform [sentiment analysis](https://labelstud.io/templates/sentiment_analysis.html) to identify the sentiment expressed by a passage of text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "project = ls.start_project(\n",
    "    title='Weak Supervision example with SDK',\n",
    "    label_config='''\n",
    "    <View>\n",
    "    <Text name=\"text\" value=\"$text\"/>\n",
    "    <View style=\"box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px;\">\n",
    "        <Header value=\"Choose text sentiment\"/>\n",
    "        <Choices name=\"sentiment\" toName=\"text\" choice=\"single\" showInLine=\"true\">\n",
    "            <Choice value=\"Positive\"/>\n",
    "            <Choice value=\"Negative\"/>\n",
    "            <Choice value=\"Neutral\"/>\n",
    "        </Choices>\n",
    "    </View>\n",
    "    </View>\n",
    "    '''\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import tasks\n",
    "\n",
    "Import small text samples into Label Studio, and retrieve their task IDs. This examples uses a small subset of tasks from the available Amazon Review dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "tasks = pd.read_csv('data/amazon_cells_labelled.tsv', sep='\\t').to_dict('records')\n",
    "tasks_ids = project.import_tasks(tasks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create noisy predictions\n",
    "\n",
    "Perform programmatic labeling to create weakly supervised annotations for the text samples. Our labeling operations, or in shorthand, **LabelOps**, are noisy programmatic labelers that reflect subject matter experts' domain knowledge in a simple pattern-to-class mapping form. In this example, assigning a sentiment class based on specific key words in the Amazon review. In more complex scenarios, the noisy labeling could be performed on the output of a learned classifier using confidence scores, crowdsourced labels, and more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re, random\n",
    "\n",
    "# Noisy programmatic labelers\n",
    "label_ops = {\n",
    "    r'.*\\b(good|excellent|great|cool)': 'Positive',\n",
    "    r'.*\\bi\\s+like': 'Positive',\n",
    "    r'.*\\bnot': 'Negative',\n",
    "    r'.*\\bdisappointed': 'Negative',\n",
    "    r'.*\\bjunk': 'Negative'\n",
    "}\n",
    "\n",
    "# Pre-annotations in Label Studio JSON format\n",
    "predictions = []\n",
    "for label_regex, label in label_ops.items():\n",
    "    model_version = label_regex\n",
    "    for task, task_id in zip(tasks, tasks_ids):\n",
    "        text = task['text'].lower()\n",
    "        if re.match(label_regex, text):\n",
    "            predictions.append({\n",
    "                'task': task_id,\n",
    "                'result': [{\n",
    "                    'from_name': 'sentiment',\n",
    "                    'to_name': 'text',\n",
    "                    'type': 'choices',\n",
    "                    'value': {\n",
    "                        'choices': [label]\n",
    "                    }\n",
    "                }],\n",
    "                'score': random.random(),\n",
    "                'model_version': model_version\n",
    "            })\n",
    "\n",
    "project.create_predictions(predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check quality of different models\n",
    "\n",
    "For each programmatic labeler used, retrieve relevant statistics such as \n",
    "- `get_predictions_coverage()` - fraction of dataset covered by specific prediction/LabelOp)\n",
    "- `get_predictions_conflict()` - the amount of conflicts for a specific LabelOp (coming soon)\n",
    "- `get_predictions_precision()` - precision computed over hold-out ground truth dataset (assuming predictions cover this dataset) (coming soon)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_versions = project.get_model_versions()\n",
    "\n",
    "# check model version stats\n",
    "pd.Series(project.get_predictions_coverage(), name='Coverage')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create annotations from specific model versions\n",
    "\n",
    "Based on quality metrics from previous steps, select a subset of high-performing programmatic labelers to use, then combine the relevant predictions into annotations for the relevant tasks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(project.create_annotations_from_predictions(model_versions=list(model_versions)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "After performing programmatic noisy labeling on a dataset, you can evaluate the quality of the predictions programmatically. Then, using the Label Studio SDK, you can transform the best quality predictions into annotations to train a weakly supervised model.\n",
    "\n",
    "If you want, you can also take the most confusing items in the dataset for the programmatic labelers and [import pre-annotations into Label Studio](https://github.com/heartexlabs/label-studio-sdk/blob/master/examples/Import%20preannotations.ipynb) for human-in-the-loop annotator review and correction.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
